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DETAILED ACTION 
Response to Amendment 

1 . In response to the Office Action mailed February 13, 2007, applicant submitted 
an amendment filed on May 17, 2007, in which the applicant traversed and requested 
reconsideration. 

Response to Arguments 

2. Applicants argue that Maali does not teach using audio and video signals to train 
a time delay neural network to determine when a person is speaking wherein the audio 
feature is the energy over an audio frame and correlating said audio features and video 
features to determine when a person is speaking. Maali teaches correlating said audio 
features and video features to determine when a person is speaking (column 3, lines 
51-60). If Maali identifies the speakers and possible frames indicating a speaker 
change, then it is inherent that it determined when that person is speaking. Applicant 
further argues that Maali does not train any type of Neural Network. However, as stated 
in the office action, Stork teaches that portion in combination with Maali wherein Stork 
teaches that the Neural Network is trained to recognize (column 4, lines 46-64). Also, in 
response to applicant's arguments against the references individually, one cannot show 
nonobviousness by attacking references individually where the rejections are based on 
combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 
1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). 

In response to applicant's argument that there is no suggestion to combine the 
references, the examiner recognizes that obviousness can only be established by 
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combining or modifying the teachings of the prior art to produce the claimed invention 
where there is some teaching, suggestion, or motivation to do so found either in the 
references themselves or in the knowledge generally available to one of ordinary skill in 
the art. See In re Fine, 837 F.2d 1071, 5 USPQ2d 1596 (Fed. Cir. 1988)and In re 
Jones, 958 F.2d 347, 21 USPQ2d 1941 (Fed. Cir. 1992). In this case, all of the 
references have been motivated by various cited portions of the reference as shown 
below. Besides Teaching, Suggestion and Motivation is not the only rationale to 
establish a prima facie case of obviousness, but only one of the available rationales. 
Rationales for arriving at a conclusion of obviousness suggested by the Supreme 
Court's decision include combining prior art elements according to known methods to 
yield predictable results, simple substitutions of one known element for another to 
obtain predictable results, use of known techniques to a known device ready for 
improvement to yield predictable results, "obvious to try"-choosing from a finite number 
of identified, predictable solutions, with a reasonable expectation of success and known 
work in one field of endeavor may prompt variations of it for use in either the same field 
or a different one based on design incentives or other market forces if the variations are 
predictable to one of ordinary skill in the art. 

Applicants further argue that Nefian does not teach reducing the noise of the 
audio signal is during preprocessing. However, in response to applicant's arguments, 
the recitation during preprocessing has not been given patentable weight because the 
recitation occurs in the preamble. A preamble is generally not accorded any patentable 
weight where it merely recites the purpose of a process or the intended use of a 
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structure, and where the body of the claim does not depend on the preamble for 
completeness but, instead, the process steps or structural limitations are able to stand 
alone. See In re Hirao, 535 F.2d 67, 190 USPQ 15 (CCPA 1976) and Kropa v. Robie, 
187 F.2d 150, 152, 88 USPQ 478, 481 (CCPA 1951). 

Applicants further argue that Stork does teach wherein the audio feature is the 
energy over an audio frame and wherein said video feature is the openness of a 
person's mouth. However, Stork teaches comparing voice sound and mouth shape 
(column 10, lines 5-35). Therefore, Applicant's arguments are not persuasive. 

Further Applicants argue that Stork teaches training of a neural network to 
recognize what is said not when someone is talking. However, if Stork knows what is 
said, it is inherent that the system knows when someone is talking. Therefore, 
Applicant's arguments are not persuasive. 

Claim Rejections - 35 USC § 103 

3. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 1 02 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

4. Claims 1-2, 9-10, 12, 24, 28 and 31 are rejected under 35 U.S.C. 103(a) as 
being unpatentable over Maali et al. (USPN 6,567,775), hereinafter referenced as Maali 
in view of Stork (USPN 5,586,215), hereinafter referenced as Stork. 
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Regarding claim 1, Maali discloses a computer-implemented process for 
detecting speech, comprising the process actions of: 

inputting associated audio and video training data containing a person's face that 
is periodically speaking (audio-video; column 3, lines 51-60); 

wherein said training comprises the following process actions: 

computing audio features from said audio training data wherein said audio 
feature is the energy over an audio frame (column 3, lines 51-60 with column 6, lines 5- 
24); 

computing video features from said video training signals wherein said video 
feature is the degree to which said person's mouth is open or closed (lip movements; 
column 1, lines 47-54); and 

correlating said audio features and video features to determine when a person is 
speaking (column 3, lines 51-60), but does not specifically teach using said audio and 
video signals to train a time delay neural network to determine when a person is 
speaking. 

Stork teaches an acoustic and visual speech recognition system using said audio 
and video signals to train a time delay neural network to determine when a person is 
speaking (column 4, lines 46-64 with column 7, lines 22-29), in order to accommodate 
utterances that may be of variable length, as well as somewhat unpredictable in the 
time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process wherein it uses said audio and 
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video signals to train a time delay neural network to determine when a person is 
speaking, as taught by Stork, to obtain the most probable utterance (column 4, lines 47- 
61). 

Regarding claim 2, Maali discloses a computer-implemented process for 
detecting speech in an audio-visual sequence, but does not specifically teach a process 
further comprising the process action of preprocessing the audio and video signals prior 
to using said audio and video signals to train a Time Delay Neural Network. 

Stork teaches an acoustic and visual speech recognition system wherein a 
process further comprising the process action of preprocessing the audio and video 
signals prior to using said audio and video signals to train a Time Delay Neural Network 
(column 4, lines 46-64 with column 7, lines 22-29), in order to accommodate utterances 
that may be of variable length, as well as somewhat unpredictable in the time of 
utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process teach wherein a process further 
comprising the process action of preprocessing the audio and video signals prior to 
using said audio and video signals to train a Time Delay Neural Network, as taught by 
Stork, to obtain the most probable utterance (column 4, lines 47-61). 

Regarding claim 9, Maali discloses a process further comprising the process 
actions of: 

inputting an associated audio and video sequence of a person periodically 
speaking (column 3, lines 51-60), but does not specifically teach using said trained Time 
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Delay Neural Network to determine when in said audio and video sequence said person 
is speaking. 

Stork teaches an acoustic and visual speech recognition system using said 
trained Time Delay Neural Network to determine when in said audio and video 
sequence said person is speaking (column 4, lines 46-64 with column 7, lines 22-29), in 
order to accommodate utterances that may be of variable length, as well as somewhat 
unpredictable in the time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process wherein it uses said trained 
Time Delay Neural Network to determine when in said audio and video sequence said 
person is speaking, as taught by Stork, to obtain the most probable utterance (column 
4, lines 47-61). 

Regarding claim 10, Maali discloses a computer-implemented process for 
detecting speech in an audio-visual sequence, but does not specifically teach the 
process action of preprocessing the associated audio and video sequence prior to using 
said trained Time Delay Neural Network to determine if a person is speaking. 

Stork teaches an acoustic and visual speech recognition system wherein the 
process action of preprocessing the associated audio and video sequence prior to using 
said trained Time Delay Neural Network to determine if a person is speaking (column 4, 
lines 46-64 with column 7, lines 22-29), in order to accommodate utterances that may 
be of variable length, as well as somewhat unpredictable in the time of utterance onset. 
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Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process teach wherein a process further 
comprising the process action of preprocessing the associated audio and video 
sequence prior to using said trained Time Delay Neural Network to determine if a 
person is speaking, as taught by Stork, to obtain the most probable utterance (column 
4, lines 47-61). 

Regarding claim 12, Maali discloses a computer-readable memory containing a 
computer program that is executable by a computer to perform the process (column 4, 
lines 9-22). 

Regarding claim 24, Maali discloses a computer-implemented process for 
detecting speech in an audio-visual sequence wherein more than one person is 
speaking at a time, comprising the process actions of: 

inputting associated audio and video training data containing more than one 
person's face wherein each person is periodically speaking at the same time as the 
other person or persons (column 4, line 51 - column 4, lines 8); and 

wherein said training comprises the following process actions: 

computing audio features from said audio training data wherein said audio 
feature is the energy over an audio frame (column 3, lines 51-60 with column 6, lines 5- 
24); 

computing video features from said video training signals to determine whether a 
given person's mouth is open or closed(lip movements; column 1, lines 47-54); and 
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correlating said audio features and video features to determine when a given 
person is speaking (column 3, lines 51-61). 

Stork teaches an acoustic and visual speech recognition system using said audio 
and video signals to train a time delay neural network to determine which person is 
speaking at a given time (column 4, lines 46-64 with column 7, lines 22-29), in order to 
accommodate utterances that may be of variable length, as well as somewhat 
unpredictable in the time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process wherein it uses said audio and 
video signals to train a time delay neural network to determine which person is speaking 
at a given time, as taught by Stork, to obtain the most probable utterance (column 4, 
lines 47-61). 

Regarding claim 28, Maali discloses a computer-implemented process for 
detecting speech, comprising the process actions of: 

inputting associated audio and video training data containing a person's face that 
is periodically speaking (column 3, lines 51-60); and 

wherein said training comprises the following process actions: 

computing audio features from said audio training (column 3, lines 51-60); 

computing video features from said video training signals wherein said video 
feature is the degree to which said person's mouth is open or closed (column 1, lines 
47-54); and 
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correlating said audio features and video features to determine when a person is 
speaking (column 3, lines 51-60). 

Stork teaches an acoustic and visual speech recognition system using said audio 
and video signals to train a statistical learning engine to determine when a person is 
speaking (column 4, lines 46-64 with column 7, lines 22-29), in order to accommodate 
utterances that may be of variable length, as well as somewhat unpredictable in the 
time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process wherein it uses said audio and 
video signals to train a statistical learning engine to determine when a person is 
speaking, as taught by Stork, to obtain the most probable utterance (column 4, lines 47- 
61). 

Regarding claim 31, Maali discloses a computer-implemented process for 
detecting speech in an audio-visual sequence, but does not specifically teach wherein 
said statistical learning engine is a Time Delay Neural Network. 

Stork teaches an acoustic and visual speech recognition system wherein said 
statistical learning engine is a Time Delay Neural Network (column 4, lines 46-64 with 
column 7, lines 22-29), in order to accommodate utterances that may be of variable 
length, as well as somewhat unpredictable in the time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process teach wherein said statistical 
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learning engine is a Time Delay Neural Network, as taught by Stork, to obtain the most 
probable utterance (column 4, lines 47-61). 

5. Claims 6-7 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Maali in view of Stork, as applied to claim 1, and in further view of Liang et al. (PGPUB 
2003/0212552), hereinafter referenced as Liang. 

Regarding claim 6, Maali discloses a process wherein the process action of 
computing video features from said video training signals comprises the process actions 
of: 

using a face detector to locate a face in said video training signals (face detector; 
column 4, lines 1-8) and 

using the geometry of a typical face to estimate the location of a mouth and 
extracting a mouth image (column 1 1 , lines 59-66), but does not specifically teach 
stabilizing the mouth, using Linear Discriminant Analysis and designating values for the 
mouth. 

Stork discloses an acoustic and visual speech recognition system comprising: 
stabilizing the mouth image to remove any translational motion of the mouth 

caused by head movement (mouth rested; column 5, lines 37-61) and 

designating values of mouth openness wherein the values range from -1 for the 

mouth being closed, to +1 for the mouth being open (assigned a value plus or minus; 

column 5, lines 37-61), to convey the essential visual information. 
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Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process wherein it stabilizes the mouth 
and assigns values to the mouth, as taught by Stork, to result in a different, but 
effective, dynamic visual data vector (column 5, lines 51-61). 

Maali in view Stork disclose a computer-implemented process for detecting 
speech, but does not specifically teach using a Linear Discriminant Analysis (LDA) 
projection to determine if the mouth in the segmented mouth image is open or closed. 

Liang discloses audiovisual speech recognition using a Linear Discriminant 
Analysis (LDA) projection to determine if the mouth in the segmented mouth image is 
open or closed (column 2, paragraph 0015), to assign pixels in the mouth region to the 
lip and face classes. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali in view of Stork's process, wherein it uses 
a Linear Discriminant Analysis (LDA) projection to determine if the mouth in the 
segmented mouth image is open or closed, as taught by Liang, to find the best 
discrimination between the classes (column 2, paragraph 0015). 

Regarding claim 7, Maali discloses a process for detecting speech, but does not 
teach a process wherein the process action of stabilizing the mouth image comprises 
the process action of using normalized cross correlation to remove any of said 
translational movement. 

Stork discloses an acoustic and visual speech recognition system wherein the 
process action of stabilizing the mouth image (mouth rested position) comprises the 
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process action of using normalized (normalization) cross correlation to remove any of 
said translational movement (column 5, lines 37-62), to convey the essential visual 
information. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process wherein the process action of 
stabilizing the mouth image comprises the process action of using normalized cross 
correlation to remove any of said translational movement, as taught by Stork, to result in 
a different, but effective, dynamic visual data vector (column 5, lines 51-61). 

6. Claims 3-5 and 11, are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Maali, in view of Stork, as applied to claim 2 above, and in further view of Nefian et 
al. (PGPUB 2004/0122675), hereinafter referenced as Nefian. 

Regarding claim 3, Maali in view of Stork disclose a process wherein said 
process action of preprocessing the audio and video signals comprises the process 
actions of: 

segmenting the audio data signals (Stork; segment audio; column 4, lines 57-65); 
segmenting the video data signals (Stork; segment video; column 4, lines 57-65); 
extracting audio features (Stork; extract audio; column 6, lines 5-24); and 
extracting video features (Stork; extract video; column 6, lines 5-24), but does not 
specifically teach reducing the noise of the audio signals. 
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extracting audio features from said sequence (Stork; extract audio; column 6, 
lines 5-24); and 

extracting video features from said sequence (Stork; extract video; column 6, 
lines 5-24), but does not specifically teach reducing the noise of the audio signals in 
said sequence. 

Nefian discloses audiovisual continuous speech recognition system reducing the 
noise of the audio signals in said sequence (column 3, paragraph 0026), to increase the 
recognition rate. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali in view of Stork's process, wherein it 
reduces the noise of the audio signals in said sequence, as taught by Nefian, to reduce 
the parameter space and overall complexity (column 3, paragraph 0026). 

7. Claims 13-15, 20-22 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Bakis et al. (USPN 6,219,639), hereinafter referenced as Bakis in view of Stork. 

Regarding claim 13, Bakis discloses a computer-readable medium having 
computer-executable instructions for use in detecting when a person in a synchronized 
audio video clip is speaking, said computer executable instructions comprising: 

inputting one or more captured video and synchronized audio clips (synchronize 
lip movement with speech; column 2, lines 21-62), 
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Nefian discloses audiovisual continuous speech recognition system reducing the 
noise of the audio signals (column 3, paragraph 0026), to increase the recognition rate. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali in view of Stork's process, wherein it 
reduces the noise of the audio signals, as taught by Nefian, to reduce the parameter 
space and overall complexity (column 3, paragraph 0026). 

Regarding claim 4, Maali discloses an audiovisual speech recognition process 
wherein the process action of segmenting the audio data signal comprises the process 
action of segmenting the audio data to determine regions of speech and non -speech 
(column 6, line 49 - column 7, line 54 with column 10, lines 58-65). 

Regarding claim 5, Maali discloses a process wherein the process action of 
segmenting the video data signal comprises the process action of segmenting the video 
data to determine at least one face and a mouth region within said determined faces 
(column 1 1 , line 59 - column 1 2, line 12). 

Regarding claim 11, Maali in view of Stork disclose a process wherein said 
process action of preprocessing the audio and video sequence comprises the process 
actions of: 

segmenting the audio data in said sequence (Stork; segment audio; column 4, 
lines 57-65); 

segmenting the video data signals in said sequence (Stork; segment video; 
column 4, lines 57-65); 
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segmenting (segment) said audio and video clips to remove portions of said 
video and synchronized (synchronize) audio clips not needed in determining if a 
speaker in the captured video and synchronized audio clips is speaking (column 4, lines 
11-67); 

extracting audio and video features in said captured video and synchronized 
audio clips to be used in determining if a speaker in the captured (extracted attribute; 
abstract with column 4, lines 10-67); and wherein an audio feature is the energy over an 
audio frame and wherein said video feature is the openness of a person's mouth 
(column 10, lines 5-35), but does not specifically teach training a Time Delay Neural 
Network to determine when a person is speaking using said extracted audio and video 
features. 

Stork teaches an acoustic and visual speech recognition system training a Time 
Delay Neural Network to determine when a person is speaking using said extracted 
audio and video features (column 4, lines 46-64 with column 7, lines 22-29), in order to 
accommodate utterances that may be of variable length, as well as somewhat 
unpredictable in the time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis' medium to train a Time Delay Neural 
Network to determine when a person is speaking using said extracted audio and video 
features, as taught by Stork, to obtain the most probable utterance (column 4, lines 47- 
61). 
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Regarding claim 14, Bakis discloses a medium for detecting speech, but does 
not specifically teach wherein the instruction for training a Time Delay Neural Network 
further comprises a sub-instruction for correlating said audio features and video features 
to determine when a person is speaking. 

Stork teaches an acoustic and visual speech recognition system wherein the 
instruction for training a Time Delay Neural Network further comprises a sub-instruction 
for correlating said audio features and video features to determine when a person is 
speaking (column 4, lines 46-64 with column 7, lines 22-29), in order to accommodate 
utterances that may be of variable length, as well as somewhat unpredictable in the 
time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis' medium wherein the instruction for 
training a Time Delay Neural Network further comprises a sub-instruction for correlating 
said audio features and video features to determine when a person is speaking, as 
taught by Stork, to obtain the most probable utterance (column 4, lines 47-61). 

Regarding claim 15, Bakis discloses the computer-readable medium further 
comprising instructions for: 

inputting a captured video and synchronized audio clip for which it is desired to 
detect a person speaking (column 4, lines 10-67), but does not specifically teach using 
said trained Time Delay Neural Network to determine when a person is speaking in the 
captured video and synchronized audio clip for which it is desired to detect a person 
speaking by using said extracted audio and video features. 
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Stork teaches an acoustic and visual speech recognition system using said 
trained Time Delay Neural Network to determine when a person is speaking in the 
captured video and synchronized audio clip for which it is desired to detect a person 
speaking by using said extracted audio and video features (column 4, lines 46-64 with 
column 7, lines 22-29), in order to accommodate utterances that may be of variable 
length, as well as somewhat unpredictable in the time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis' medium wherein it uses said trained Time 
Delay Neural Network to determine when a person is speaking in the captured video 
and synchronized audio clip for which it is desired to detect a person speaking by using 
said extracted audio and video features, as taught by Stork, to obtain the most probable 
utterance (column 4, lines 47-61). 

Regarding claim 20, Bakis discloses a system for detecting a speaker in a video 
segment that is synchronized with associated audio, the system comprising: 

a general purpose computing device (column 10, lines 5-35); and 

a computer program comprising program modules executable by the computing 
device, wherein the computing device is directed by the program modules of the 
computer program to (column 10, lines 5-35), 

input one or more captured video and synchronized audio segments (column 2, 
lines 21-47 with column 4, lines 10-67), 



Application/Control Number: 10/606,061 Page 19 

Art Unit: 2626 

segment said audio and video segments to remove portions of said video and 
synchronized audio segments not needed in determining if a speaker in the captured 
video and synchronized audio segments is speaking (column 4, lines 10-67); 

extract audio and video features in said captured video and synchronized audio 
segments to be used in determining if a speaker in the captured video and synchronized 
audio segments is speaking, wherein said audio feature is the energy over an audio 
frame and said video feature is the openness of a person's mouth in said video and 
synchronized audio segments (column 4, lines 10-67); and 

input a captured video and synchronized audio clip for which it is desired to 
detect a person speaking (column 4, lines 10-67), but does not specifically teach 
training a TDNN. 

Stork teaches an acoustic and visual speech recognition system training a Time 
Delay Neural Network to determine when a person is speaking using said extracted 
audio and video features and use said trained Time Delay Neural Network to determine 
when a person is speaking in the captured video and synchronized audio segments for 
which it is desired to detect a person speaking (column 4, lines 46-64 with column 7, 
lines 22-29), in order to accommodate utterances that may be of variable length, as well 
as somewhat unpredictable in the time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis' medium wherein it trains a Time Delay 
Neural Network to determine when a person is speaking using said extracted audio and 
video features and use said trained Time Delay Neural Network to determine when a 
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person is speaking in the captured video and synchronized audio segments for which it 
is desired to detect a person speaking, as taught by Stork, to obtain the most probable 
utterance (column 4, lines 47-61). 

Regarding claim 21, Bakis discloses a system wherein it outputs a 1 when a 
person is talking for each frame in said captured video and synchronized audio 
segments for which it is desired to detect a person speaking, and outputs a 0 when no 
person is talking (column 12, lines 12-65), but does not specifically teach using Time 
Delay Neural Network to train. 

Stork teaches an acoustic and visual speech recognition system using Time 
Delay Neural Network to train (column 4, lines 46-64 with column 7, lines 22-29), in 
order to accommodate utterances that may be of variable length, as well as somewhat 
unpredictable in the time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis' medium wherein it using Time Delay 
Neural Network to train, as taught by Stork, to obtain the most probable utterance 
(column 4, lines 47-61). 

Regarding claim 22, Bakis discloses a system wherein said Time Delay Neural 
Network comprises: 

one output, wherein said output is set to 0 when no person in the video and 
synchronized audio segment is speaking; and wherein said output is set to 1 when a 
person in the video and synchronized audio segment is speaking (column 12, lines 12- 
65), but does not specifically teach an input layer and two hidden layers 
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Stork discloses an acoustic and visual recognition system comprising an input 
layer (column 8, lines 24-32) and two hidden layers (column 15, lines 12-37), in order to 
accommodate utterances that may be of variable length. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis' system wherein it comprises an input 
layer and two hidden layers, as taught by Stork, to enhance understanding (column 1, 
lines 36-51). 

8. Claim 16 is rejected under 35 U.S.C. 103(a) as being unpatentable over Bakis in 
view of Stork, as applied to claim 13 above, and in further view of Nefian. 

Regarding claim 16, Bakis in view of Stork disclose the computer-readable 
medium for detecting speech, but does not specifically teach a medium further 
comprising an instruction for reducing noise in said audio video clips prior to said 
instruction for segmenting said audio and video clips. 

Nefian discloses audiovisual continuous speech recognition system reducing 
noise in said audio video clips prior to said instruction for segmenting said audio and 
video clips (column 3, paragraph 0026), to increase the recognition rate. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis and view of Stork's process, wherein it 
reduces noise in said audio video clips prior to said instruction for segmenting said 
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audio and video clips, as taught by Nefian, to reduce the parameter space and overall 
complexity (column 3, paragraph 0026). 

9. Claims 17-18 and 23 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Bakis in view of Stork and in further view of Liang. 

Regarding claim 17, Bakis discloses a process wherein the process action of 
computing video features from said video training signals comprises the process actions 
of: 

using a face detector to locate a face in said video training signals (column 2, 
lines 21-47 with column 4, lines 10-67) and 

using the geometry of a typical face to estimate the location of a mouth and 
extracting a mouth image (column 6, lines 7-19), but does not specifically teach 
stabilizing the mouth, using Linear Discriminant Analysis and designating values for the 
mouth. 

Stork discloses an acoustic and visual speech recognition system comprising: 
stabilizing the mouth image to remove any translational motion of the mouth 

caused by head movement (mouth rested; column 5, lines 37-61) and 

designating values of mouth openness wherein the values range from -1 for the 

mouth being closed, to +1 for the mouth being open (assigned a value plus or minus; 

column 5, lines 37-61), to convey the essential visual information. 
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Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis' process wherein it stabilizes the mouth 
and assigns values to the mouth, as taught by Stork, to result in a different, but 
effective, dynamic visual data vector (column 5, lines 51-61). 

Bakis in view Stork disclose a computer-implemented process for detecting 
speech, but does not specifically teach using a Linear Discriminant Analysis (LDA) 
projection to determine if the mouth in the segmented mouth image is open or closed. 

Liang discloses audiovisual speech recognition using a Linear Discriminant 
Analysis (LDA) projection to determine if the mouth in the segmented mouth image is 
open or closed (column 2, paragraph 0015), to assign pixels in the mouth region to the 
lip and face classes. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis in view of Stork's process, wherein it uses 
a Linear Discriminant Analysis (LDA) projection to determine if the mouth in the 
segmented mouth image is open or closed, as taught by Liang, to find the best 
discrimination between the classes (column 2, paragraph 0015). 

Regarding claim 18, Bakis discloses the computer-readable medium for 
detecting speech, but does not specifically teach wherein said sub-instruction for 
stabilizing the mouth image to remove any translational motion of the mouth caused by 
head movement employs normalized cross correlation. 

Stork discloses an acoustic and visual speech recognition system wherein said 
sub-instruction for stabilizing the mouth image to remove any translational motion of the 
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mouth caused by head movement employs normalized cross correlation (mouth rested; 
column 5, lines 37-61), to convey the essential visual information. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis' process wherein said sub-instruction for 
stabilizing the mouth image to remove any translational motion of the mouth caused by 
head movement employs normalized cross correlation, as taught by Stork, to result in a 
different, but effective, dynamic visual data vector (column 5, lines 51-61). 

Regarding claim 23, Bakis discloses a system wherein the module for extracting 
audio and video features comprises sub-modules to extract the video features 
comprising: 

using a face detector to locate a face in said video training signals (column 2, 
lines 21-47 with column 4, lines 10-47); 

using the geometry of a typical face to estimate the location of a mouth and 
extracting a mouth image (column 6, lines 7-19), but does not specifically teach 
stabilizing the mouth and using Linear Discriminant Analysis. 

Stork discloses an acoustic and visual speech recognition system comprising: 

stabilizing the mouth image to remove any translational motion of the mouth 
caused by head movement (mouth rested; column 5, lines 37-61), to convey the 
essential visual information. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis's system wherein it stabilizes the mouth, 
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as taught by Stork, to result in a different, but effective, dynamic visual data vector 
(column 5, lines 51-61). 

Bakis in view Stork disclose a computer-implemented process for detecting 
speech, but does not specifically teach using a Linear Discriminant Analysis (LDA) 
projection to determine if the mouth in the segmented mouth image is open or closed. 

Liang discloses audiovisual speech recognition using a Linear Discriminant 
Analysis (LDA) projection to determine if the mouth in the segmented mouth image is 
open or closed (column 2, paragraph 0015), to assign pixels in the mouth region to the 
lip and face classes. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Bakis in view of Stork's process, wherein it uses 
a Linear Discriminant Analysis (LDA) projection to determine if the mouth in the 
segmented mouth image is open or closed, as taught by Liang, to find the best 
discrimination between the classes (column 2, paragraph 0015). 

10. Claims 25-27 is rejected under 35 U.S.C. 103(a) as being unpatentable over 
Maali in view of Stork, as applied to claim 24 above, and in further view of Liang and in 
further view of Applicant's Admitted Prior Art (PGPUB 2004/0267521). 

Regarding claim 25, Maali discloses a process wherein the process action of 
computing video features from said video training signals comprises the process actions 
of: 
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using a face detector to locate a face in said video training signals (face detector; 
column 4, lines 1-8) and 

using the geometry of a typical face to estimate the location of a mouth and 
extracting a mouth image (column 11, lines 59-66), but does not specifically teach 
stabilizing the mouth, using Linear Discriminant Analysis and designating values for the 
mouth. 

Stork discloses an acoustic and visual speech recognition system comprising: 

stabilizing the mouth image to remove any translational motion of the mouth 
caused by head movement (mouth rested; column 5, lines 37-61) and 

designating values of mouth openness wherein the values range from -1 for the 
mouth being closed, to +1 for the mouth being open (assigned a value plus or minus; 
column 5, lines 37-61), to convey the essential visual information. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process wherein it stabilizes the mouth 
and assigns values to the mouth, as taught by Stork, to result in a different, but 
effective, dynamic visual data vector (column 5, lines 51-61). 

Maali in view Stork disclose a computer-implemented process for detecting 
speech, but does not specifically teach using a Linear Discriminant Analysis (LDA) 
projection to determine if the mouth in the segmented mouth image is open or closed. 

Liang discloses audiovisual speech recognition using a Linear Discriminant 
Analysis (LDA) projection to determine if the mouth in the segmented mouth image is 



Application/Control Number: 10/606,061 Page 27 

Art Unit: 2626 

open or closed (column 2, paragraph 0015), to assign pixels in the mouth region to the 
lip and face classes. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali in view of Stork's process, wherein it uses 
a Linear Discriminant Analysis (LDA) projection to determine if the mouth in the 
segmented mouth image is open or closed, as taught by Liang, to find the best 
discrimination between the classes (column 2, paragraph 0015). 

Maali in view of Stork and Liang disclose a process for detecting speech, but 
does not specifically teach using a microphone array beam form on each face. 

However, based on Applicant's own admission beamforming is a well known 
technique for improving the sound quality of the speaker (columns 4-5, paragraph 
0055). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali in view of Stork and Liang's process 
wherein it teaches beamforming, as taught by Applicant, to improve the sound quality oif 
the speaker by filtering out sound not coming from the direction of the speaker (columns 
4-5, paragraph 0055). 

Regarding claim 26, Maali in view of Stork and Liang disclose a process for 
detecting speech, but does not specifically teach wherein said audio feature is 
computed using said beam formed audio training data. 
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However, based on Applicant's own admission beamforming is a well known 
technique for improving the sound quality of the speaker (columns 4-5, paragraph 
0055). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali in view of Stork and Liang's process 
wherein it teaches beamforming, as taught by Applicant, to improve the sound quality oif 
the speaker by filtering out sound not coming from the direction of the speaker (columns 
4-5, paragraph 0055). 

Regarding claim 27, Maali discloses a process further comprising the process 
actions of: 

inputting an associated audio and video sequence of a person periodically 
speaking (column 3, lines 51-60), but does not specifically teach using said trained Time 
Delay Neural Network to determine when in said audio and video sequence said person 
is speaking. 

Stork teaches an acoustic and visual speech recognition system using said 
trained Time Delay Neural Network to determine when in said audio and video 
sequence said person is speaking (column 4, lines 46-64 with column 7, lines 22-29), in 
order to accommodate utterances that may be of variable length, as well as somewhat 
unpredictable in the time of utterance onset. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali's process wherein it uses said trained 
Time Delay Neural Network to determine when in said audio and video sequence said 
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person is speaking, as taught by Stork, to obtain the most probable utterance (column 
4, lines 47-61). 

1 1 . Claim 32 is rejected under 35 U.S.C. 103(a) as being unpatentable over Maali in 
view of Stork, as applied to claim 28 above, and in further view of Nefian. 

Regarding claim 32, Maali in view of Stork disclose a computer-implemented 
process wherein it detects speech, but does not specifically teach wherein said 
statistical learning engine is a Support Vector Machine. 

Nefian discloses audiovisual continuous speech recognition wherein said 
statistical learning engine is a Support Vector Machine (column 2, paragraph 0016), to 
remove false alarms. 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to modify Maali in view of Stork's process, wherein said 
statistical learning engine is a Support Vector Machine, as taught by Nefian, to obtain a 
low or minimal correlation with speech (column 2, paragraph 0016). 

Allowable Subject Matter 

12. Claims 8 and 19 are objected to as being dependent upon a rejected base 
claim, but would be allowable if rewritten in independent form including all of the 
limitations of the base claim and any intervening claims. 
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Conclusion 

1 3. THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time 
policy as set forth in 37 CFR 1 .136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within 
TWO MONTHS of the mailing date of this final action and the advisory action is not 
mailed until after the end of the THREE-MONTH shortened statutory period, then the 
shortened statutory period will expire on the date the advisory action is mailed, and any 
extension fee pursuant to 37 CFR 1 .136(a) will be calculated from the mailing date of 
the advisory action. In no event, however, will the statutory period for reply expire later 
than SIX MONTHS from the mailing date of this final action. 

14. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Jakieda R. Jackson whose telephone number is 571- 
272-7619. The examiner can normally be reached on Monday-Friday from 5:30am- 
2:00pm. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, David Hudspeth can be reached on 571-272-7843. The fax phone number 
for the organization where this application or proceeding is assigned is 571-273-8300. 
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Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 
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